Parallelization of Maximum Entropy POS Tagging for Bahasa Indonesia with MapReduce

نویسندگان

Arif Nurwidyantoro

Edi Winarko

چکیده

In this paper, MapReduce programming model is used to parallelize training and tagging proceess in maximum entropy part of speech tagging for Bahasa Indonesia. In training process, MapReduce model is implemented dictionary, tagtoken, and feature creation. In tagging process, MapReduce is implemented to tag lines of document in parallel. The training experiments showed that total training time using MapReduce is faster, but its result reading time inside the process slow down the total training time. The tagging experiments using different number of map and reduce process showed that MapReduce implementation could speedup the tagging process. The fastest tagging result is showed by tagging process using 1,000,000 word corpus and 30 map process.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Probabilistic Part Of Speech Tagging for Bahasa Indonesia

In this paper we report our work in developing Part of Speech Tagging for Bahasa Indonesia using probabilistic approaches. We use Condtional Random Fields (CRF) and Maximum Entropy methods in assigning the tag to a word. We use two tagsets containing 37 and 25 part-of-speech tags for Bahasa Indonesia. In this work we compared both methods using using two different corpora. The results of the ex...

متن کامل

NING MA et al: FUSION OF WORD CLUSTERING FEATURES FOR TIBETAN PART OF SPEECH TAGGING

Tibetan Part of Speech (POS) tagging, the foundation of Tibetan natural language processing, judges word classification according to contextual information of words. Based on the framework of the maximum entropy model, the paper studied the fusion of morphological features for Tibetan part of speech with maximum entropy model with the integration of word clustering features. Experimental result...

متن کامل

A Maximum Entropy Tagger with Unsupervised Hidden Markov Models

We describe a new tagging model where the states of a hidden Markov model (HMM) estimated by unsupervised learning are incorporated as the features in a maximum entropy model. Our method for exploiting unsupervised learning of a probabilistic model can reduce the cost of building taggers with no dictionary and a small annotated corpus. Experimental results on English POS tagging and Japanese wo...

متن کامل

An improved joint model: POS tagging and dependency parsing

Dependency parsing is a way of syntactic parsing and a natural language that automatically analyzes the dependency structure of sentences, and the input for each sentence creates a dependency graph. Part-Of-Speech (POS) tagging is a prerequisite for dependency parsing. Generally, dependency parsers do the POS tagging task along with dependency parsing in a pipeline mode. Unfortunately, in pipel...

متن کامل

Maximum Entropy Based Bengali Part of Speech Tagging

Part of Speech (POS) tagging can be described as a task of doing automatic annotation of syntactic categories for each word in a text document. This paper presents a POS tagger for Bengali using the statistical Maximum Entropy (ME) model. The system makes use of the different contextual information of the words along with the variety of features that are helpful in predicting the various POS cl...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

CoRR

دوره abs/1208.3047 شماره

صفحات -

تاریخ انتشار 2012

Parallelization of Maximum Entropy POS Tagging for Bahasa Indonesia with MapReduce

نویسندگان

چکیده

منابع مشابه

Probabilistic Part Of Speech Tagging for Bahasa Indonesia

NING MA et al: FUSION OF WORD CLUSTERING FEATURES FOR TIBETAN PART OF SPEECH TAGGING

A Maximum Entropy Tagger with Unsupervised Hidden Markov Models

An improved joint model: POS tagging and dependency parsing

Maximum Entropy Based Bengali Part of Speech Tagging

عنوان ژورنال:

اشتراک گذاری